# Classic CNN for Handwritten Digits

Richard Lin Jianxiong Xu Yifeng Zhang Genwen Zhao

## **Project Overview**

Classic CNN handwritten digit recognizer implemented on Nexys board. Able to classify user handwritten digit displayed in front of an HDMI camera.

#### Why FPGA?

 High computational capability on floating point matrix operation. Great for CNN application. Configurable HW for fast development & testing.

#### Main requirements:

- HLS to develop CNN core with FC, CONV and Max Pooling layers
- HDMI video IPs and Microblaze processor for data capture and control
- External DDR for main data storage

# Design Environment

Hardware Platform - Nexys Video board - Artix-7 XC7A200T-1SBG484C

Software Tools - 2017.2 Vivado, HLS, & SDK. Python 3.7.3

Revision Control - Initially Dropbox and Google Drive. Github for submission.

- Design is partitioned into HLS CNN Core (Genwen & Yifeng) or HDMI System (Jianxiong & Richard)
- Both sections are individually tested & demoed first
- Integrate & review design files through in-person team meetings

## Overall System architecture - HDMI

- HDMI Input Datapath
  - Convert TMDS/DVI to AXI-4 Stream RGB Data
  - Store this 1920x1080 image in DDR memory
- Process image in SW, run CNN, write image of detected number to DDR (described in next slides)
- HDMI Output Datapath
  - Access DDR for above image output
  - Convert from AXI-4 Stream
    RGB Data to TMDS/DVI



## Overall System architecture - HLS Core

- CNN core 1 of each in HW
  - Convolutional layer
  - Max pooling layer
  - Fully connected layer
- Software sends data through CONV, POOL, CONV, POOL, FC, FC
- Microblaze provides a processed image (28x28, black and white)
- Outputs final results to Microblaze



## Overall System architecture - HLS Core

- Trained LeNet CNN Architecture with MNIST database (60,000 training examples) on Caffe framework
- Achieved 99.05% accuray with the test data (10,000 test images) in Caffe framework





## Overall System architecture - Microblaze

- CPU to schedule tasks, initialize HW, and manage UART
- Convert HDMI Image:
  - RGB to B/W
  - 1920x1080 to 28x28
- Pass processed image and enable HLS
  CNN Core layers successively
- Use softmax to process HLS CNN core output. Write final recognized number to HDMI display and sent probabilities of each number over UART



## Verification & Testing of the IP Modules

#### IP level:

- Each HLS/IP module has its own test bench, and verification was done individually before system integration.
  - Hardware modules
    - HDMI image capture block
    - HLS CNN core
    - Microblaze
    - DDR access
  - Software modules
    - Image compression engine and preprocessing function
    - CNN core drivers
- Defined clear I/O interface for each module, and data/control paths between
  IPs based on our top level system diagram.

## Verification & Testing of the System

#### System level:

- Our system level testing strategy is to integrate one block at a time
  - First test the Microblaze and DDR access part together
  - Second add the HLS core to the system
  - Third integrate the image compression engine into the system
  - Last add the HDMI module to perform the whole system testing
- Our integration and testing strategy helps to reduce the complexity and time spent on debugging

## Overall HLS experience:

#### HLS in project:

- Main CNN Core is implemented using HLS.
- Develop CNN in algorithmic level using C/C++, save 60 70% of time compared to traditional HDL

#### Advantage:

- Easy to implement complex algorithm in hardware using high level programming language, without taking care of low level hardware interface
- Fast turnaround time for hardware changes. Minimal change in main algorithm, use HLS
  Pragma to generate different hardware architectures for specific requirements

# Overall HLS experience - cont'

#### Disadvantage:

 Unpredictable hardware from HLS Pragma - Pipeline & unrolling can generate unnecessary hardware for low efficient data parallelism due to other coding limitation

Timing sensitive logic, FSM control logic are still easier to implement in HDL for more predictable hardware

- Debugging HLS generated hardware can be challenging compared to HDL

### Project Schedule and Accomplishment

Milestone 1: Research and Setup Phase - March 20th (Completed)

- Richard & Jianxiong researched for Vivado Project, board IO, memory
- Yifeng & Genwen researched on CNN optimization technique and exiting CNN training methodology
- Overall system architecture, design partition and interface definition

#### Milestone 2: Development Phase - April 15th (Completed)

- Successfully bring up HDMI video streaming (direct Video Input to output on monitor)
- Successfully bring up CNN core and verify functionality in IP level
- Successfully bring up LeNet training model and obtain trained parameter sets

#### Milestone 3: Integration and Test phase - April 30th(Completed)

- Successfully package and Integrate CNN core to Nexys board with HDMI video IPs
- Developed a full working CNN application in Xilinx SDK flow- including CNN core drivers, HDMI image capture and process functions
- Able to achieve >90% classification accuracy on HDMI camera captured handwritten digits!

## Project Demo and Q/A

Github Link: <a href="https://github.com/richardlin23/ECE1373\_CNN\_Digits">https://github.com/richardlin23/ECE1373\_CNN\_Digits</a>

Demo with Camera Input: <a href="https://youtu.be/ZbqtS1Rtxsl">https://youtu.be/ZbqtS1Rtxsl</a>

Demo with Computer Input: <a href="https://youtu.be/WsqfhlvcGVA">https://youtu.be/WsqfhlvcGVA</a>

Questions for the team?

iter a selection:3start machine learning

Machine Learning Dem

